Exploratory Data Analysis of White Wine Quality by HINDHUJA GUTHA

Introduction:

The dataset used in this EDA is related to white wine samples of the Portuguese “Vinho Verde” wine.For more details, consult: http://www.vinhoverde.pt/en/ or the reference [Cortez et al., 2009].

Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).

The inputs include objective tests (e.g. PH values) and the output is based on sensory data (median of at least 3 evaluations made by wine experts). Each expert graded the wine quality between 0 (very bad) and 10 (very excellent).

The classes are ordered and not balanced (e.g. there are munch more normal wines than excellent or poor ones). Outlier detection algorithms could be used to detect the few excellent or poor wines.

Attribute Information :

Input variables (based on physicochemical tests):

Output variable (based on sensory data):

Univariate Plots Section

In this section summary of all variables and information about dataset is analysed along with histograms for important variables and if necessary new variables are created

White Wine Dataset Summary

Null values in Dataset

## [1] 0

row count

## [1] 4898

column count

## [1] 13

Dataset Summary

##        X        fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :   1   Min.   : 3.800   Min.   :0.0800   Min.   :0.0000  
##  1st Qu.:1225   1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700  
##  Median :2450   Median : 6.800   Median :0.2600   Median :0.3200  
##  Mean   :2450   Mean   : 6.855   Mean   :0.2782   Mean   :0.3342  
##  3rd Qu.:3674   3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900  
##  Max.   :4898   Max.   :14.200   Max.   :1.1000   Max.   :1.6600  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.600   Min.   :0.00900   Min.   :  2.00     
##  1st Qu.: 1.700   1st Qu.:0.03600   1st Qu.: 23.00     
##  Median : 5.200   Median :0.04300   Median : 34.00     
##  Mean   : 6.391   Mean   :0.04577   Mean   : 35.31     
##  3rd Qu.: 9.900   3rd Qu.:0.05000   3rd Qu.: 46.00     
##  Max.   :65.800   Max.   :0.34600   Max.   :289.00     
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  9.0        Min.   :0.9871   Min.   :2.720   Min.   :0.2200  
##  1st Qu.:108.0        1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100  
##  Median :134.0        Median :0.9937   Median :3.180   Median :0.4700  
##  Mean   :138.4        Mean   :0.9940   Mean   :3.188   Mean   :0.4898  
##  3rd Qu.:167.0        3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500  
##  Max.   :440.0        Max.   :1.0390   Max.   :3.820   Max.   :1.0800  
##     alcohol         quality     
##  Min.   : 8.00   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.40   Median :6.000  
##  Mean   :10.51   Mean   :5.878  
##  3rd Qu.:11.40   3rd Qu.:6.000  
##  Max.   :14.20   Max.   :9.000

Dataset Observations:

idealpH categorical variable Based on the information 3-3.4 is best pH level for white wines, a categorical variable idealPh is created which takes value ‘yes’ when pH level is in between 3-3.4 otherwise the value will be ‘no’

Fixed Acidity Summary

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.800   6.300   6.800   6.855   7.300  14.200

Fixed Acidity plot

Volatile Acidity Summary

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0800  0.2100  0.2600  0.2782  0.3200  1.1000

Volatile Acidity plot

Citric Acid Summary

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.2700  0.3200  0.3342  0.3900  1.6600

Citric Acid plot

Residual Sugar Summary

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.600   1.700   5.200   6.391   9.900  65.800

Residual Sugar plot

Total sulfur dioxide Summary

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     9.0   108.0   134.0   138.4   167.0   440.0

Total Sulfur dioxide plot

Density Summary

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9917  0.9937  0.9940  0.9961  1.0390

Density plot

pH Summary

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.720   3.090   3.180   3.188   3.280   3.820

pH plot

IdealpH category variable (3-3.4pH value) Summary

##   No  Yes 
##  834 4064

Bar plot for idealPH variable

Sulphates Summary

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2200  0.4100  0.4700  0.4898  0.5500  1.0800

Sulphates plot

Alcohol(% by volume) Summary

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.50   10.40   10.51   11.40   14.20

Alcohol(% by volume) plot

Quality Summary

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.878   6.000   9.000

Quality plot

Univariate Analysis

Number of Instances in white wine Dataset : 4898.

Number of Attributes: Total 13 columns,column “X” to represent sample & remaining 12 columns represent sample attributes

Missing Attribute Values: None

What is the structure of your dataset?

dataset is tidy and there are no missing values .

What is/are the main feature(s) of interest in your dataset?

residual sugar, alcohol,pH and fixed.acidity are main attributes

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

quality,sulphates and density can help in understanding more about wine

Did you create any new variables from existing variables in the dataset?

I have created idealpH category variable based on ideal pH range 3-3.4

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

  • Most of the individual variables are normally distributed.
  • Residual sugar distribution is skewed.
  • fixed.acidity,volatile.acidity,citric.acid,total.sulfur.dioxide,density and residual.sugar has some outliers
  • More than 80% of samples are in ideal pH range
  • All levels of quality are not present

Bivariate Plots Section

Based on above individual variable analysis ,in this section Bivariate Analysis is done to show comparisons and trends between two varaibles Scatterplot is a good way to analyze bivariate relationshhip , It is used. Plots are analysed for below pairs

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

  • Negative association between fixed.acidity and pH value
  • Positive association between sulphate and pH value
  • Trend increased and decreased multiple times for fixed.acidity and sulphates
  • Most of the sample shave residual sugar below 20 grams with some outliers
  • Alochol and quality seems to have a postive correlation
  • Majority samples have alcohol level above 10 % and are in ideal pH range

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

  • Total sulfur dioxide is used to determine freshness of wine and majority samples have total sulfur dioxide above 100 which suggests that most wine samples are not aged well.

What was the strongest relationship you found?

  • strongest relationship is found between fixed.acidity vs pH & sulphate vs pH

Multivariate Plots Section

In this section association between multiple variables is explored. Based on Bivariate plots Analysis below variables are analyzed together.

Relationship between Fixed Acidity , Sulphates and pH value

Relationship between Alcohol , Residual sugar and Quality

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

  • Majority samples have fixed.acidity above 6 .
  • Distribution of pH values w.r.t fixed.acidity & sulfates is normal and there are some outliers. This is further strengthened by Bivariate analysis between pH & sulphates , pH & fixed.acidity

Were there any interesting or surprising interactions between features?

lower pH values are present when fixed.acidity is more higher pH values are present when fixed.acidity value is less Higher quality samples are seen when alcohol is in between 8.5-10% ——

Final Plots and Summary

Below are 3 plots with most interesting findings

Plot One

Description One

Ideal pH range of white wine is in between 3-3.4. From above plot we can see that more than 80% samples are in ideal pH range.

Plot Two

Description Two

From plot 1 and 2 we can confirm that majority samples with ideal pH range are have quality levels in between 5 to 7 . quality has some positive association with ideal Ph value

Plot Three

Description Three

From above plot we can confirm that when fixed.acidity value is more pH value is less and vice versa and same relation can be seen in between Sulphates and pH Which strengthens findings of individual Bivariate analysis with pH value

Reflection

This is the tidiest dataset and it was easy to perform Univariate analysis. Found trends between fixed acidity & ph , suphates & ph. Interesting find is majority samples have total.sulfur.dioxide above 100 For Bivariate Analysis I couldnt figure out main attributes and supporting attributes initially, resulting in some re-work. After some research on white wine I was able to determine.Which suggests that prior knowledge of dataset is required to make proper analysis. Some of the main attributes of wine like age, tannins,types of grapes etc are not mentioned in dataset which would have helped in understanding more about quality.

References